-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix wrong calcuration of retry interval and detecting retry limit #3649
Conversation
We are now checking both actual retry process & calculating max retry timeout... |
Based on the current behavior, the documentation is wrong, and the previous patch is more correct. The current retry behavior is strange and makes this problem more difficult. I looked at the implementation and checked the behavior.
So However, it is questionable whether this retry behavior is correct in the first place.
|
Thanks. I'll also check validity of current implementation & behaviour. |
I used the following <match test.**>
@type stdout
<buffer>
flush_mode interval
flush_interval 1
retry_wait 3
retry_randomize false
retry_max_times 3
</buffer>
</match> I changed def write(chunk)
$log.info("!!!FLUSH!!!")
raise "For Debug"
end Then I got the following fluentd log.
|
I think we should fix the logic of diff --git a/lib/fluent/plugin/output.rb b/lib/fluent/plugin/output.rb
index 54dbbeb5..4056f7af 100644
--- a/lib/fluent/plugin/output.rb
+++ b/lib/fluent/plugin/output.rb
@@ -1284,20 +1284,6 @@ module Fluent
end
return
end
- end
-
- # @retry exists
-
- if @retry.limit?
- if error
- records = @buffer.queued_records
- msg = "failed to flush the buffer, and hit limit for retries. dropping all chunks in the buffer queue."
- log.error msg, retry_times: @retry.steps, records: records, error: error
- log.error_backtrace error.backtrace
- end
- @buffer.clear_queue!
- log.debug "buffer queue cleared"
- @retry = nil
else
# Ensure that the current time is greater than or equal to @retry.next_time to avoid the situation when
# @retry.step is called almost as many times as the number of flush threads in a short time.
@@ -1306,7 +1292,7 @@ module Fluent
else
@retry.recalc_next_time # to prevent all flush threads from retrying at the same time
end
- if error
+ if error && [email protected]?
if using_secondary
msg = "failed to flush the buffer with secondary output."
log.warn msg, retry_times: @retry.steps, next_retry_time: @retry.next_time.round, chunk: chunk_id_hex, error: error
@@ -1318,6 +1304,18 @@ module Fluent
end
end
end
+
+ if @retry.limit?
+ if error
+ records = @buffer.queued_records
+ msg = "failed to flush the buffer, and hit limit for retries. dropping all chunks in the buffer queue."
+ log.error msg, retry_times: @retry.steps, records: records, error: error
+ log.error_backtrace error.backtrace
+ end
+ @buffer.clear_queue!
+ log.debug "buffer queue cleared"
+ @retry = nil
+ end
end
end
diff --git a/lib/fluent/plugin_helper/retry_state.rb b/lib/fluent/plugin_helper/retry_state.rb
index 131f9eb6..7a7f2b22 100644
--- a/lib/fluent/plugin_helper/retry_state.rb
+++ b/lib/fluent/plugin_helper/retry_state.rb
@@ -158,7 +158,7 @@ module Fluent
def calc_max_retry_timeout(max_steps)
result = 0
- (0..max_steps).each { |i|
+ max_steps.times { |i|
result += calc_interval(i)
}
result
|
Yea, it's also needed. In addition, some comments in my commits and some tests are still wrong. I'm now addressing them too. |
The correct total retry time should be: c + c * b^1 + (...) + c*b^(k - 1) But the previous implementation was: c + c * b^0 + c * b^1 + (...) + c*b^(k - 1) where: * c: constant factor, @retry_wait * b: base factor, @retry_exponential_backoff_base * k: number of retry times, @max_retry_times Although the first retry interval is already added at constructor of ExponentialBackoffRetry, same value is added again unexpectedly on the first calc_interval call (as `c * b^0`). Signed-off-by: Takuro Ashie <[email protected]>
Hmm, this patch breaks |
The previous implementation detects @max_retry_times + 1 as limit, but the limit should be @max_retry_times. Signed-off-by: Takuro Ashie <[email protected]>
Signed-off-by: Takuro Ashie <[email protected]>
Signed-off-by: Takuro Ashie <[email protected]>
I have created PR to add scenario tests: #3685 |
RetryStatement behavior is complicated, so scenario tests may be useful. Signed-off-by: Daijiro Fukuda <[email protected]>
As for timeout, how about managing the status this way? This diff make some tests fail, I'm checking it. diff --git a/lib/fluent/plugin_helper/retry_state.rb b/lib/fluent/plugin_helper/retry_state.rb
index 29bcf8c1..8816773d 100644
--- a/lib/fluent/plugin_helper/retry_state.rb
+++ b/lib/fluent/plugin_helper/retry_state.rb
@@ -44,6 +44,8 @@ module Fluent
@timeout = timeout
@timeout_at = @start + timeout
+ @has_reached_timeout = false
+ @has_timeouted = false
@current = :primary
if randomize_width < 0 || randomize_width > 0.5
@@ -123,7 +125,15 @@ module Fluent
@current = :secondary
@secondary_transition_steps = @steps
end
+
@next_time = calc_next_time
+
+ unless @has_reached_timeout
+ @has_reached_timeout = @next_time >= @timeout_at
+ else
+ @has_timeouted = @next_time >= @timeout_at
+ end
+
nil
end
@@ -131,24 +141,11 @@ module Fluent
@next_time = calc_next_time
end
- # Use @next_time for time by default to keep backward compatibility
- def limit?(time: @next_time, steps: @steps)
- timeout?(time) || limit_step?(steps)
- end
-
- def timeout?(time = current_time)
- if @forever
- false
- else
- time >= @timeout_at
- end
- end
-
- def limit_step?(steps = @steps)
+ def limit?
if @forever
false
else
- !!(@max_steps && steps >= @max_steps)
+ @has_timeouted || !!(@max_steps && @steps >= @max_steps)
end
end
end
|
Have to fix scenario test to test this diff. diff --git a/test/plugin_helper/test_retry_state.rb b/test/plugin_helper/test_retry_state.rb
index 824151c3..f44f069c 100644
--- a/test/plugin_helper/test_retry_state.rb
+++ b/test/plugin_helper/test_retry_state.rb
@@ -988,15 +988,9 @@ class RetryStateHelperTest < Test::Unit::TestCase
msg << "[#{next_elapsed}s elapsed point] #{retry_count}th-Retry(#{s.secondary? ? "SEC" : "PRI"}) is triggered.\n"
# Update retry statement
- if s.limit?
- msg << "--- Reach limit of timeout. ---\n"
- break
- end
-
s.step
-
- if s.limit_step?
- msg << "--- Reach limit of max step. ---\n"
+ if s.limit?
+ msg << "--- Reach limit. ---\n"
break
end
end |
I have created PR: #3686 I have confirmed the following test success.
|
Add 2 states to control timeout. Signed-off-by: Daijiro Fukuda <[email protected]>
Signed-off-by: Daijiro Fukuda <[email protected]>
Thank you for the merge! |
It's complicated how the behavior changes in the end.
|
Signed-off-by: Takuro Ashie <[email protected]>
See also: fluent/fluentd#3649 Before the fix and fluent/fluentd#3640 , the both of retry behavior and the document were wrong. Actual behavior: `c + c + cb^1 + ... + cb^(k-1)` (k+1 retries totally) Total calculation: `cb^(-1) + c + cb^1 + ... + cb^(k-2)` Signed-off-by: Daijiro Fukuda <[email protected]>
I have created the document fix: fluent/fluentd-docs-gitbook#395 |
Cool! Now the logic of |
See also: fluent/fluentd#3649 Before the fix, the number was mistakenly 19. This fixes it to 18. (The document was also wrong.) n-th retry is triggered at `2^n - 1` seconds elapsed since the first flush. So 18-th retry is at 262143 seconds elapsed and this exceeds 72 hours (259200 seconds). If the next retry is going to exceed this time limit, the last retry will be made at exactly this time limit. So the last time is 18. Signed-off-by: Daijiro Fukuda <[email protected]>
Which issue(s) this PR fixes:
Fixes #3609
What this PR does / why we need it:
The correct total retry time should be:
But the previous implementation was:
where:
Although the first retry interval is already added at constructor of
ExponentialBackoffRetry, same value is added again unexpectedly on the
first calc_interval call (as
c * b^0
).In addition, this PR also fixes wrong detection of retry limit.
Docs Changes:
https://docs.fluentd.org/configuration/buffer-section#retries-parameters is also wrong, should be fixed too.
Release Note:
Same with the title